Brain metastasis is one of the leading causes of brain cancer. This experiment investigates transcriptomic changes based on knocking out lipid metabolism genes in various cell lines using Crispr Cas9. The profiling was done on 4 main cell lines - MDAMB231, HCC1806, HCC1954, and JIMT1. The following GEO dataset was chosen in Assignment 1 by going through various published papers - https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE148372 which provied the counts data based on the expression profiling done through high throughput sequencing.
Our experiment will look into how each cell line differently reacted to the knocking out of metabolism genes and therefore we will be sampling based on cell_lines. This will be done in a pool of 40 divided equally into the 4 kinds of cell lines.
Assignment 1 has handled most of the data cleaning and provides us with the mormalized data counts for each of these cell lines which will be used for the rest of this paper. The data was normalized using the TMM method and showed that some cell lines cluster together in the MDS plot showing higher coherence. Some visualizations from the previous findings -
### MDS plot to see relations between counts of different cell lines in the experiment
knitr::include_graphics("data/MDS_A1.png")
Figure 1. Multidimensional Scaling (MDS) plot with age group coloring, with the clustering of samples based on gene expression data.
### Gene representation for each cell line
knitr::include_graphics("data/GeneRep_A1.png")
Figure 1. Multidimensional Scaling (MDS) plot with age group coloring, with the clustering of samples based on gene expression data.
samples <- data.frame(ID = 1:40, cell_line = c("HCC1806", "HCC1954", "JIMT1", "MDAMB231", "HCC1806", "HCC1954", "JIMT1", "MDAMB231", "HCC1806", "HCC1954", "JIMT1", "MDAMB231", "HCC1806", "HCC1954", "JIMT1", "MDAMB231", "HCC1806", "HCC1954", "JIMT1", "MDAMB231", "HCC1806", "HCC1954", "JIMT1", "MDAMB231", "HCC1806", "HCC1954", "JIMT1", "MDAMB231", "HCC1806", "HCC1954", "JIMT1", "MDAMB231", "HCC1806", "HCC1954", "JIMT1", "MDAMB231", "HCC1806", "HCC1954", "JIMT1", "MDAMB231"))
head(samples)
knitr::include_graphics("data/MDS_A1.png")
* Since the data has been
normalized, there have not been any outliers from the clusters and the
cell lines are well represented in the data. * We can now use the
cell_line attribute to calculate the differential data.
samples$cell_line <- relevel(as.factor(samples$cell_line), ref = "HCC1806")
design_matrix <- model.matrix(~ samples$cell_line)
head(design_matrix)
## (Intercept) samples$cell_lineHCC1954 samples$cell_lineJIMT1
## 1 1 0 0
## 2 1 1 0
## 3 1 0 1
## 4 1 0 0
## 5 1 0 0
## 6 1 1 0
## samples$cell_lineMDAMB231
## 1 0
## 2 0
## 3 0
## 4 1
## 5 0
## 6 0
counts <- normalized_count_data[,2: ncol(normalized_count_data)-1]
counts
dge <- DGEList(counts = counts, group = unique(samples$cell_line))
dge <- calcNormFactors(dge)
disp <- estimateDisp(dge, design_matrix)
## Warning in estimateDisp.default(y = y$counts, design = design, group = group, :
## No residual df: setting dispersion to NA
plotMeanVar(disp,
show.raw.vars = TRUE,
NBline=TRUE,
ylim=c(0, 60),
main = "Mean-Variance Plot showing variance of cell lines from control")
## Warning in xy.coords(x, y, xlabel, ylabel, log): 2921 x values <= 0 omitted from
## logarithmic plot
## Warning in plot.window(...): nonfinite axis=2 limits [GScale(-inf,1.77815,..);
## log=TRUE] -- corrected now
# Calculating P-valiues
# get expression mtrx
expression <- as.matrix(log2(counts))
rownames(expression) <- normalized_count_data$gene
colnames(expression) <- colnames(counts)
# get minimal set and fit using lm
minimal_set <- ExpressionSet(assayData=expression)
fit <- lmFit(minimal_set, design_matrix[1:4])
The threshold used was 0.05 as it gives us a good confidence interval for the p-values and is generally used as a rule of thumb.
# Using Bergamini method to adjust values and fit bayesian trend
updated_fit <- eBayes(fit, trend=TRUE)
hits <- topTable(updated_fit,
coef=ncol(design_matrix[1:4]),
adjust.method = "BH", # using bergamini
number = nrow(expression))
outputs <- merge(normalized_count_data[1],
hits,
by.y=0,by.x=1,
all.y=TRUE)
# ranking the values on p-value
outputs <- outputs[order(outputs$P.Value),]
head(outputs)
(p <- length(which(outputs$P.Value < 0.05)))
## [1] 16789
(adj_p <- length(which(outputs$adj.P.Val < 0.05)))
## [1] 16485
The bergamini method was used for hypothesis correction. It is clear that the adjuted outputs are lower than the values before adjusting and not as many passed the correction.
In this section we can use a volcano plot to highlight the differencially expressed genes and show genes of interest.
volcano <- data.frame(
"gene" = outputs$HCC1806,
"logFC" = outputs$logFC,
"adj.P.Val" = outputs$adj.P.Val
)
EnhancedVolcano(volcano,
lab = volcano$gene,
x = 'logFC',
y = 'adj.P.Val',
pCutoff = 0.05,
title = "Volcano Plot with threshold cutoff of 0.05",
ylim = c(0,10)
)
top_hits <- outputs$HCC1806[outputs$adj.P.Val < 0.05 && abs(outputs$logFC) > 1]
## Warning in outputs$adj.P.Val < 0.05 && abs(outputs$logFC) > 1: 'length(x) =
## 23686 > 1' in coercion to 'logical(1)'
## Warning in outputs$adj.P.Val < 0.05 && abs(outputs$logFC) > 1: 'length(x) =
## 23686 > 1' in coercion to 'logical(1)'
heatmap <- normalized_count_data[which(normalized_count_data$gene %in% top_hits),][1:4]
heatmap <- na.omit(t(scale(t(heatmap)))[1:50,])
heatmap <- Heatmap(as.matrix(heatmap),
show_row_names = FALSE)
heatmap
We will choose to run an enrichment analysis to compare for over representation of genes *Find the top up regulated genes
top_up <- outputs$HCC1806[outputs$logFC > 1 && outputs$adj.P.Val < 0.05]
## Warning in outputs$logFC > 1 && outputs$adj.P.Val < 0.05: 'length(x) = 23686 >
## 1' in coercion to 'logical(1)'
## Warning in outputs$logFC > 1 && outputs$adj.P.Val < 0.05: 'length(x) = 23686 >
## 1' in coercion to 'logical(1)'
query_res <- gost(query = top_up, organism = "hsapiens")
query_res$result$query <- "up regulated"
gostplot(query_res)
We use GO dataset for annotation as it is the one that has the most hits with our outputs and is one of the most commonly used datasets.
res <- query_res$result[query_res$result$source == "GO:BP", ]
## Use a dot plot to show visualization of enrichment analysis
ggplot(res[1:20,]) +
geom_point(aes(
x = precision,
color = p_value,
y = term_name,
size = intersection_size)) +
labs(title = "Enrichment analysis on up regulated genes from dataset")
### Down-regulated genes
top_down <- outputs$HCC1806[outputs$logFC < -1]
query_res <- gost(query = top_down,
organism = "hsapiens")
gostplot(query_res)
Again, we use GO:BP to compare with it being most representative and plot the dot plot using ggplot
res_down <- query_res$result[query_res$result$source == "GO:BP", ]
ggplot(res_down[1:20,]) +
geom_point(aes(
x = precision,
color = p_value,
y = term_name,
size = intersection_size)) +
labs(title = "Enrichment analysis on down regulated genes from dataset")
The over-representation results seen here do support the conclusions in the original paper that the knocking out metabolic genes from cell lines can affect different cell lines in different ways. There were similarities between the control cell line used in the experiment and the HCC1954 cell line while the other two were farther away on the MDS plots. As seen in our analysis the different cell lines are differencially expressed with some being clustered together.
I have not been able to find any similar experiments to support the results seen here.